Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-verbal information in spontaneous speech -- towards a new framework of analysis

Published 6 Mar 2024 in cs.SD, cs.CL, cs.LG, and eess.AS | (2403.03522v2)

Abstract: Non-verbal signals in speech are encoded by prosody and carry information that ranges from conversation action to attitude and emotion. Despite its importance, the principles that govern prosodic structure are not yet adequately understood. This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning. The schema interprets surface-representations of multi-layered prosodic events. As a first step towards implementation, we present a classification process that disentangles prosodic phenomena of three orders. It relies on fine-tuning a pre-trained speech recognition model, enabling the simultaneous multi-class/multi-label detection. It generalizes over a large variety of spontaneous data, performing on a par with, or superior to, human annotation. In addition to a standardized formalization of prosody, disentangling prosodic patterns can direct a theory of communication and speech organization. A welcome by-product is an interpretation of prosody that will enhance speech- and language-related technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Thein, M.L.: Die Informationelle Struktur Im Englischen: Syntax und Information Als Mittel der Hervorhebung vol. 323. Walter de Gruyter GmbH & Co KG, ??? (2017) Xu et al. [2015] Xu, Y., Lee, A., Prom-On, S., Liu, F.: Explaining the penta model: a reply to arvaniti and ladd. Phonology 32(3), 505–535 (2015) Cole [2015] Cole, J.: Prosody in context: A review. Language, Cognition and Neuroscience 30(1-2), 1–31 (2015) Ladd [2014] Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Xu, Y., Lee, A., Prom-On, S., Liu, F.: Explaining the penta model: a reply to arvaniti and ladd. Phonology 32(3), 505–535 (2015) Cole [2015] Cole, J.: Prosody in context: A review. Language, Cognition and Neuroscience 30(1-2), 1–31 (2015) Ladd [2014] Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cole, J.: Prosody in context: A review. Language, Cognition and Neuroscience 30(1-2), 1–31 (2015) Ladd [2014] Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  2. Xu, Y., Lee, A., Prom-On, S., Liu, F.: Explaining the penta model: a reply to arvaniti and ladd. Phonology 32(3), 505–535 (2015) Cole [2015] Cole, J.: Prosody in context: A review. Language, Cognition and Neuroscience 30(1-2), 1–31 (2015) Ladd [2014] Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cole, J.: Prosody in context: A review. Language, Cognition and Neuroscience 30(1-2), 1–31 (2015) Ladd [2014] Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  3. Cole, J.: Prosody in context: A review. Language, Cognition and Neuroscience 30(1-2), 1–31 (2015) Ladd [2014] Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  4. Ladd, D.R.: Simultaneous Structure in Phonology vol. 28. OUP Oxford, ??? (2014) Radford et al. [2023] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  5. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518 (2023). PMLR Du Bois et al. [2014] Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  6. Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., Paolino, D.: Outline of discourse transcription. In: Talking Data, pp. 45–89. Psychology Press, ??? (2014) Himmelmann et al. [2018] Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  7. Himmelmann, N.P., Sandler, M., Strunk, J., Unterladstetter, V.: On the universality of intonational phrases: A cross-linguistic interrater study. Phonology 35(2), 207–245 (2018) Halliday [2015] Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  8. Halliday, M.A.K.: Intonation and Grammar in British English vol. 48. Walter de Gruyter GmbH & Co KG, ??? (2015) Beckman and Pierrehumbert [1986] Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  9. Beckman, M.E., Pierrehumbert, J.B.: Intonational structure in japanese and english. Phonology 3, 255–309 (1986) Silverman et al. [1992] Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  10. Silverman, K.E., Beckman, M.E., Pitrelli, J.F., Ostendorf, M., Wightman, C.W., Price, P., Pierrehumbert, J.B., Hirschberg, J.: Tobi: A standard for labeling english prosody. In: ICSLP, vol. 2, pp. 867–870 (1992) Reed [2009] Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  11. Reed, B.S.: Units of interaction:“intonation phrases” or “turn constructional phrases”. Actes/Proceedings from IDP (Interface Discours & Prosodie), 351–363 (2009) Degand and Simon [2005] Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  12. Degand, L., Simon, A.C.: Minimal discourse units: Can we define them, and why should we. Proceedings of SEM-05. Connectors, discourse framing and discourse structure: from corpus-based and experimental analyses to discourse theories, Biarritz, 14–15 (2005) Su and Tseng [2018] Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  13. Su, C.-y., Tseng, C.-y.: Perceivable information structure in discourse prosody-detecting prominent prosodic words in spoken discourse using f0 contour. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 424–428 (2018). IEEE Hannay and Kroon [2005] Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  14. Hannay, M., Kroon, C.: Acts and the relationship between discourse and grammar. Functions of language 12(1), 87–124 (2005) Couper-Kuhlen and Selting [2017] Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  15. Couper-Kuhlen, E., Selting, M.: Interactional Linguistics: Studying Language in Social Interaction. Cambridge University Press, ??? (2017) Jakobson [1984] Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  16. Jakobson, R.: Russian and Slavic Grammar: Studies, 1931-1981 vol. 106. Walter de Gruyter, ??? (1984) Hockett [1960] Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  17. Hockett, C.F.: The origin of speech. Scientific American 203(3), 88–97 (1960) Jacobs et al. [2015] Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  18. Jacobs, C.L., Yiu, L.K., Watson, D.G., Dell, G.S.: Why are repeated words produced with reduced durations? evidence from inner speech and homophone production. Journal of Memory and Language 84, 37–48 (2015) Hirose et al. [1984] Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  19. Hirose, K., Fujisaki, H., Yamaguchi, M.: Synthesis by rule of voice fundamental frequency contours of spoken japanese from linguistic information. In: ICASSP’84. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 597–600 (1984). IEEE Glass [1995-present] Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  20. Glass, I.: This American Life. Chicago Public Media. [Online]. Available: https://www.thisamericanlife.org/archive (1995-present) Owens [2013] Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  21. Owens, J.: The arabic grammatical tradition. The Semitic Languages 46 (2013) Schippers [1997] Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  22. Schippers, A.: The hebrew grammatical tradition. The Semitic Languages, 59–65 (1997) Wagner and Watson [2010] Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  23. Wagner, M., Watson, D.G.: Experimental and theoretical advances in prosody: A review. Language and cognitive processes 25(7-9), 905–945 (2010) Wennerstrom [2001] Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  24. Wennerstrom, A.: The Music of Everyday Speech: Prosody and Discourse Analysis. Oxford University Press, ??? (2001) Triantafyllopoulos et al. [2023] Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  25. Triantafyllopoulos, A., Schuller, B.W., İymen, G., Sezgin, M., He, X., Yang, Z., Tzirakis, P., Liu, S., Mertes, S., André, E., et al.: An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE (2023) Biron et al. [2021] Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  26. Biron, T., Baum, D., Freche, D., Matalon, N., Ehrmann, N., Weinreb, E., Biron, D., Moses, E.: Automatic detection of prosodic boundaries in spontaneous speech. PloS one 16(5), 0250969 (2021) Rosenberg [2010] Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  27. Rosenberg, A.: Classification of prosodic events using quantized contour modeling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 721–724 (2010) Barbosa [2008] Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  28. Barbosa, P.A.: Prominence-and boundary-related acoustic correlations in brazilian portuguese read and spontaneous speech. In: Proceedings of the Speech Prosody 2008 Conference, pp. 257–260 (2008). Citeseer Calhoun et al. [2023] Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  29. Calhoun, S., Yan, M., Salanoa, H., Taupi, F., Kruse Va’ai, E.: Focus effects on immediate and delayed recognition of referents in samoan. Language and Speech 66(1), 175–201 (2023) Sridhar et al. [2008] Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  30. Sridhar, V.K.R., Bangalore, S., Narayanan, S.S.: Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE transactions on audio, speech, and language processing 16(4), 797–811 (2008) Roll et al. [2023] Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  31. Roll, N., Graham, C., Todd, S.: Psst! prosodic speech segmentation with transformers. arXiv preprint arXiv:2302.01984 (2023) Wu et al. [2023] Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  32. Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., Chang, S.: Uncovering the disentanglement capability in text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1910 (2023) De Saussure [1989] De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  33. De Saussure, F.: Cours de Linguistique Générale vol. 1. Otto Harrassowitz Verlag, ??? (1989) Hjelmslev and Whitfield [1953] Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  34. Hjelmslev, L., Whitfield, F.J.: Prolegomena to a theory of language (1953) Austin [1975] Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  35. Austin, J.L.: How to do Things with Words vol. 88. Oxford university press, ??? (1975) Chomsky [2002] Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  36. Chomsky, N.: Syntactic Structures. Mouton de Gruyter, ??? (2002) Caroll et al. [1998] Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  37. Caroll, J., Briscoe, T., Sanfilippo, A.: Parser evaluation: a survey and a new proposal. In: LREC, vol. 998, pp. 447–454 (1998) Pennington et al. [2014] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  38. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Behre et al. [2023] Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  39. Behre, P., Tan, S., Varadharajan, P., Chang, S.: Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition. arXiv preprint arXiv:2301.03819 (2023) Kim [2014] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  40. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014). Association for Computational Linguistics Yin et al. [2017] Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  41. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 (2017) Bahdanau et al. [2014] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  42. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) Driess et al. [2023] Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  43. Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) Peng et al. [2023] Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  44. Peng, C., Chen, K., Shou, L., Chen, G.: Carat: Contrastive feature reconstruction and aggregation for multi-modal multi-label emotion recognition. arXiv preprint arXiv:2312.10201 (2023) Li et al. [2023] Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  45. Li, Y., Du, H., Ni, Y., Zhao, P., Guo, Q., Yuan, F., Zhou, X.: Multi-modality is all you need for transferable recommender systems. arXiv preprint arXiv:2312.09602 (2023) Devlin [2005] Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  46. Devlin, K.: Confronting context effects in intelligence analysis: How can mathematics help. Center for the Study of Language and Information, Stanford University (2005) Matalon [2021] Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  47. Matalon, N.: The camel humps prosodic pattern. Building categories in interaction: Linguistic resources at work 220, 155 (2021) Matalon et al. [Under revision] Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  48. Matalon, N., Weinreb, E., Freche, D., Volk, E., Biron, T., Moses, E., Biron, D.: Structure in Conversational Prosody: Evidence for Vocabulary, Semantics and Syntax of Intonation Units. Under revision (Under revision) Weinrich [2024] Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  49. Weinrich, H.: Tempus: The World of Discussion and the World of Narration. Fordham Univ Press, ??? (2024) Shisha-Halevy [2005] Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  50. Shisha-Halevy, A.: Epistolary grammar: Syntactical highlights in kate roberts’s correspondence with saunders lewis. Journal of Celtic Linguistics 9(1), 83–103 (2005) Shisha-Halevy [2007] Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  51. Shisha-Halevy, A.: Converbs in welsh and irish. In: 13th International Congress of Celtic Studies, Bonn (2007). Conference Paper Couper-Kuhlen [2015] Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  52. Couper-Kuhlen, E.: Intonation and discourse. The handbook of discourse analysis, 82–104 (2015) Couper-Kuhlen [1986] Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  53. Couper-Kuhlen, E.: An Introduction to English Prosody. TUEBINGEN, ??? (1986) Selting et al. [2010] Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  54. Selting, M., Barth-Weingarten, D., Reber, E., Selting, M.: Prosody in interaction. Prosody in Interaction, Amsterdam/Philadelphia, John Benjamins, 3–40 (2010) Dogil [2003] Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  55. Dogil, G.: Understanding prosody. In: Rickheit, G., Herrmann, T., Deutsch, W. (eds.) Psycholinguistics: Ein Internationales Handbuch, pp. 544–565. De Gruyter Mouton, Berlin • New York (2003). https://doi.org/10.1515/9783110114249.4.544 Cornille et al. [2022] Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  56. Cornille, T., Wang, F., Bekker, J.: Interactive multi-level prosody control for expressive speech synthesis. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8312–8316 (2022). IEEE Cenceschi et al. [2021] Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  57. Cenceschi, S., Sbattella, L., Tedesco, R.: Calliope: A multi-dimensional model for the prosodic characterization of information units. Estudios de fonética experimental, 227–245 (2021) Du Bois et al. [2000] Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  58. Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa barbara corpus of spoken american english. CD-ROM. Philadelphia: Linguistic Data Consortium (2000) Hashem et al. [2023] Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  59. Hashem, A., Arif, M., Alghamdi, M.: Speech emotion recognition approaches: A systematic review. Speech Communication, 102974 (2023) Klie et al. [2018] Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  60. Klie, J.-C., Bugert, M., Boullosa, B., Castilho, R.E., Gurevych, I.: The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018) Honnibal et al. [2020] Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  61. Honnibal, M., Montani, I., Landeghem, S.V., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python. Available online. Accessed: 2023-01-14 (2020) Hennequin et al. [2020] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  62. Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020) McAuliffe et al. [2017] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  63. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: Trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017) Goodwin and Heritage [1990] Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  64. Goodwin, C., Heritage, J.: Conversation analysis. Annual review of anthropology 19(1), 283–307 (1990) Wolf et al. [2019] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  65. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) Breen et al. [2012] Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  66. Breen, M., Dilley, L.C., Kraemer, J., Gibson, E.: Inter-transcriber reliability for two systems of prosodic annotation: Tobi (tones and break indices) and rap (rhythm and pitch). Corpus linguistics and linguistic theory 8(2), 277–312 (2012) Belinkov and Glass [2017] Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017) Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)
  67. Belinkov, Y., Glass, J.: Analyzing hidden representations in end-to-end automatic speech recognition systems. Advances in Neural Information Processing Systems 30 (2017)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 3 likes about this paper.